:orphan: Sklearn Basics 2: Train a Classifier on a Star Multi-Table Dataset ================================================================== In this notebook, we will learn how to train a classifier with a multi-table data composed of two tables (a root table and a secondary table). It is highly recommended to see the *Sklearn Basics 1* lesson if you are not familiar with Khiops’ sklearn estimators. We start by importing the sklearn estimator ``KhiopsClassifier``: .. code:: ipython3 import os import pandas as pd from khiops import core as kh from khiops.sklearn import KhiopsClassifier, train_test_split_dataset from sklearn import metrics # If there are any issues you may Khiops status with the following command # kh.get_runner().print_status() Training a Multi-Table Classifier ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ We’ll train a “sarcasm detector” using the dataset ``HeadlineSarcasm``. In its raw form, the dataset contains a list of text headlines paired with a label that indicates whether its source is a sarcastic site (such as `The Onion `__) or not. We have transformed this dataset into two tables such that the text-label record :: "groundbreaking study finds gratification can be deliberately postponed" yes is transformed to an entry in a table that contains (id, label) records :: 97 yes and various entries in a secondary table linking a headline id to its words and positions :: 97 0 groundbreaking 97 1 study 97 2 finds 97 3 gratification 97 4 can 97 5 be 97 6 deliberately 97 7 postponed Thus, the ``HeadlineSarcasm`` dataset has the following multi-table schema :: +-----------+ |Headline | +-----------+ +-------------+ |HeadlineId*| |HeadlineWords| |IsSarcastic| +-------------+ +-----------+ |HeadlineId* | | |Position | +-1:n--->|Word | +-------------+ The ``HeadlineId`` variable is special because it is a *key* that links a particular headline to its words (a ``1:n`` relation). *Note: There are other methods more appropriate for this text-mining problem. This multi-table setup is only intended for pedagogical purposes.* To train the ``KhiopsClassifier`` for this setup we must specify a multi-table dataset. Let’s first check the content of the created tables: - The main table ``Headline`` - The secondary table ``HeadlineWords`` .. code:: ipython3 sarcasm_dataset_dir = os.path.join("data", "HeadlineSarcasm") headlines_file = os.path.join(sarcasm_dataset_dir, "Headlines.txt") headlines_df = pd.read_csv(headlines_file, sep="\t") print("Headlines table (first 10 rows)") display(headlines_df.head(10)) headlines_words_file = os.path.join(sarcasm_dataset_dir, "HeadlineWords.txt") headlines_words_df = pd.read_csv(headlines_words_file, sep="\t") print("HeadlineWords table (first 10 rows)") display(headlines_words_df.head(10)) .. parsed-literal:: Headlines table (first 10 rows) .. parsed-literal:: HeadlineId IsSarcasm 0 0 yes 1 1 no 2 10 no 3 100 yes 4 1000 yes 5 10000 no 6 10001 yes 7 10002 no 8 10003 yes 9 10004 no .. parsed-literal:: HeadlineWords table (first 10 rows) .. parsed-literal:: HeadlineId Position Word 0 0 0 thirtysomething 1 0 1 scientists 2 0 2 unveil 3 0 3 doomsday 4 0 4 clock 5 0 5 of 6 0 6 hair 7 0 7 loss 8 1 0 dem 9 1 1 rep. Before training the classifier, we split the main table into a feature matrix (only the ``HeadlineId`` column) and a target vector containing the labels (the ``IsSarcasm`` column). .. code:: ipython3 headlines_main_df = headlines_df.drop("IsSarcasm", axis=1) y_sarcasm = headlines_df["IsSarcasm"] You may note that the feature matrix does not contain any *feature* but do not worry. The Khiops AutoML engine will automatically create features by aggregating the columns of ``HeadlineWords`` for each headline (more details about this below). Moreover, instead of passing an ``X`` table to the ``fit`` method, we pass a *multi-table dataset* specification which is dictionary with the following format: :: X = { "main_table": , "tables" : { : (, ), : (, ), ... } } Note that the key columns of each table are specified as a single name or a tuple containing the column names composing the key. So for our ``HeadlineSarcasm`` case, we specify the dataset as: .. code:: ipython3 X_sarcasm = { "main_table": "headlines", "tables": { "headlines": (headlines_main_df, "HeadlineId"), "headline_words": (headlines_words_df, "HeadlineId"), }, } To separate this dataset into train and test, we user the ``khiops-python`` helper function ``train_test_split_dataset``. This function allows to separate ``dict`` dataset specifications: .. code:: ipython3 ( X_sarcasm_train, X_sarcasm_test, y_sarcasm_train, y_sarcasm_test, ) = train_test_split_dataset(X_sarcasm, y_sarcasm) The call to the ``KhiopsClassifier`` ``fit`` method is very similar to the single-table case but this time we specify the additional parameter ``n_features`` which is the number of aggregates that Khiops AutoML engine will construct and analyze during the training. Some examples of the features it will create for ``HeadlineSarcasm`` are: - Number of different words in the headline - Most common word in the headline - Number of times the word ‘the’ appears - … The Khiops AutoML engine will also evaluate, select and combine the these features to build a classifier. We’ll here request for ``1000`` features (the default is ``100``): *Note: By default Khiops builds 10 decision tree features. This is not necessary for this tutorial so we set ``n_trees=0``* .. code:: ipython3 khc_sarcasm = KhiopsClassifier(n_features=1000, n_trees=0) khc_sarcasm.fit(X_sarcasm_train, y_sarcasm_train) .. raw:: html
KhiopsClassifier(n_features=1000, n_trees=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
We quickly check its train accuracy and auc as in the previous tutorial: .. code:: ipython3 sarcasm_train_performance = ( khc_sarcasm.model_report_.train_evaluation_report.get_snb_performance() ) print(f"HeadlineSarcasm train accuracy: {sarcasm_train_performance.accuracy}") print(f"HeadlineSarcasm train auc : {sarcasm_train_performance.auc}") .. parsed-literal:: HeadlineSarcasm train accuracy: 0.850867 HeadlineSarcasm train auc : 0.933792 Now, we use our sarcasm classifier to obtain predictions and probabilities on the test data: .. code:: ipython3 y_sarcasm_test_predicted = khc_sarcasm.predict(X_sarcasm_test) probas_sarcasm_test = khc_sarcasm.predict_proba(X_sarcasm_test) print("HeadlineSarcasm test predictions (first 10 values):") display(y_sarcasm_test_predicted[:10]) print("HeadlineSarcasm test prediction probabilities (first 10 values):") display(probas_sarcasm_test[:10]) .. parsed-literal:: HeadlineSarcasm test predictions (first 10 values): .. parsed-literal:: array(['no', 'no', 'yes', 'yes', 'yes', 'no', 'no', 'no', 'no', 'no'], dtype=object) .. parsed-literal:: HeadlineSarcasm test prediction probabilities (first 10 values): .. parsed-literal:: array([[0.98051026, 0.01948974], [0.7168483 , 0.2831517 ], [0.48756231, 0.51243769], [0.08162827, 0.91837173], [0.30038081, 0.69961919], [0.8818798 , 0.1181202 ], [0.87340021, 0.12659979], [0.74002932, 0.25997068], [0.96795465, 0.03204535], [0.75040413, 0.24959587]]) Finally we may estimate the accuracy and AUC for the test data: .. code:: ipython3 sarcasm_test_accuracy = metrics.accuracy_score(y_sarcasm_test, y_sarcasm_test_predicted) sarcasm_test_auc = metrics.roc_auc_score(y_sarcasm_test, probas_sarcasm_test[:, 1]) print(f"Sarcasm test accuracy: {sarcasm_test_accuracy}") print(f"Sarcasm test auc : {sarcasm_test_auc}") .. parsed-literal:: Sarcasm test accuracy: 0.8251572327044026 Sarcasm test auc : 0.9083383942327357 To further explore the results we can see the report with the Khiops Visualization app: .. code:: ipython3 # To visualize uncomment the lines below khc_sarcasm.export_report_file("./sarcasm_report.khj") kh.visualize_report("./sarcasm_report.khj") .. parsed-literal:: Could not open report file: [Errno 2] No such file or directory: 'xdg-open'. Path: ./sarcasm_report.khj Exercise ~~~~~~~~ Repeat the previous steps with the ``AccidentsSummary`` dataset. This dataset describes the characteristics of traffic accidents that happened in France in 2018. It has two tables with the following schema: :: +---------------+ |Accidents | +---------------+ |AccidentId* | |Gravity | |Date | |Hour | +---------------+ |Light | |Vehicles | |Department | +---------------+ |Commune | |AccidentId* | |InAgglomeration| |VehicleId* | |... | |Direction | +---------------+ |Category | | |PassengerNumber| +---1:n--->|... | +---------------+ For each accident, we have both its characteristics (such as ``Gravity`` or ``Light`` conditions) and those of each involved vehicle (its ``Direction`` or ``PassengerNumber``). We first load the tables of the ``AccidentsSummary`` into dataframes: .. code:: ipython3 accidents_dataset_dir = os.path.join(kh.get_samples_dir(), "AccidentsSummary") accidents_file = os.path.join(accidents_dataset_dir, "Accidents.txt") accidents_df = pd.read_csv(accidents_file, sep="\t", encoding="latin1") print(f"Accidents dataframe (first 10 rows):") display(accidents_df.head(10)) print() vehicles_file = os.path.join(accidents_dataset_dir, "Vehicles.txt") vehicles_df = pd.read_csv(vehicles_file, sep="\t", encoding="latin1") print(f"Vehicles dataframe (first 10 rows):") display(vehicles_df.head(10)) .. parsed-literal:: Accidents dataframe (first 10 rows): .. parsed-literal:: AccidentId Gravity Date Hour Light \ 0 201800000001 NonLethal 2018-01-24 15:05:00 Daylight 1 201800000002 NonLethal 2018-02-12 10:15:00 Daylight 2 201800000003 NonLethal 2018-03-04 11:35:00 Daylight 3 201800000004 NonLethal 2018-05-05 17:35:00 Daylight 4 201800000005 NonLethal 2018-06-26 16:05:00 Daylight 5 201800000006 NonLethal 2018-09-23 06:30:00 TwilightOrDawn 6 201800000007 NonLethal 2018-09-26 00:40:00 NightStreelightsOn 7 201800000008 Lethal 2018-11-30 17:15:00 NightStreelightsOn 8 201800000009 NonLethal 2018-02-18 15:57:00 Daylight 9 201800000010 NonLethal 2018-03-19 15:30:00 Daylight Department Commune InAgglomeration IntersectionType Weather \ 0 590 5 No Y-type Normal 1 590 11 Yes Square VeryGood 2 590 477 Yes T-type Normal 3 590 52 Yes NoIntersection VeryGood 4 590 477 Yes NoIntersection Normal 5 590 52 Yes NoIntersection LightRain 6 590 133 Yes NoIntersection Normal 7 590 11 Yes NoIntersection Normal 8 590 550 No NoIntersection Normal 9 590 51 Yes X-type Normal CollisionType PostalAddress 0 2Vehicles-BehindVehicles-Frontal route des Ansereuilles 1 NoCollision Place du général de Gaul 2 NoCollision Rue nationale 3 2Vehicles-Side 30 rue Jules Guesde 4 2Vehicles-Side 72 rue Victor Hugo 5 Other D39 6 Other 4 route de camphin 7 Other rue saint exupéry 8 Other rue de l'égalité 9 2Vehicles-BehindVehicles-Frontal face au 59 rue de Lille .. parsed-literal:: Vehicles dataframe (first 10 rows): .. parsed-literal:: AccidentId VehicleId Direction Category PassengerNumber \ 0 201800000001 A01 Unknown Car<=3.5T 0 1 201800000001 B01 Unknown Car<=3.5T 0 2 201800000002 A01 Unknown Car<=3.5T 0 3 201800000003 A01 Unknown Motorbike>125cm3 0 4 201800000003 B01 Unknown Car<=3.5T 0 5 201800000003 C01 Unknown Car<=3.5T 0 6 201800000004 A01 Unknown Car<=3.5T 0 7 201800000004 B01 Unknown Bicycle 0 8 201800000005 A01 Unknown Moped 0 9 201800000005 B01 Unknown Car<=3.5T 0 FixedObstacle MobileObstacle ImpactPoint Maneuver 0 NaN Vehicle RightFront TurnToLeft 1 NaN Vehicle LeftFront NoDirectionChange 2 NaN Pedestrian NaN NoDirectionChange 3 StationaryVehicle Vehicle Front NoDirectionChange 4 NaN Vehicle LeftSide TurnToLeft 5 NaN NaN RightSide Parked 6 NaN Other RightFront Avoidance 7 NaN Vehicle LeftSide NaN 8 NaN Vehicle RightFront PassLeft 9 NaN Vehicle LeftFront Park Create the main feature matrix and the target vector for ``AccidentsSummary`` ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note that the target variable is ``Gravity``. .. code:: ipython3 accidents_main_df = accidents_df.drop("Gravity", axis=1) y_accidents = accidents_df["Gravity"] Create the multi-table dataset specification ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Note the main table has one key ``AccidentId`` and the secondary table has two keys ``AccidentId`` and ``VehicleId``. .. code:: ipython3 X_accidents = { "main_table": "accidents", "tables": { "accidents": (accidents_main_df, "AccidentId"), "vehicles": (vehicles_df, ["AccidentId", "VehicleId"]), }, } Split the dataset into train and test ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 ( X_accidents_train, X_accidents_test, y_accidents_train, y_accidents_test, ) = train_test_split_dataset(X_accidents, y_accidents) Train a classifier with this dataset ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ - You may choose the number of features ``n_features`` to be created by the Khiops AutoML engine - Set the number of trees to zero (``n_trees=0``) .. code:: ipython3 khc_accidents = KhiopsClassifier(n_trees=0, n_features=1000) khc_accidents.fit(X_accidents_train, y_accidents_train) .. raw:: html
KhiopsClassifier(n_features=1000, n_trees=0)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Print the train accuracy and auc of the model ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 accidents_train_performance = ( khc_accidents.model_report_.train_evaluation_report.get_snb_performance() ) print(f"AccidentsSummary train accuracy: {accidents_train_performance.accuracy}") print(f"AccidentsSummary train auc : {accidents_train_performance.auc}") .. parsed-literal:: AccidentsSummary train accuracy: 0.944343 AccidentsSummary train auc : 0.81777 Deploy the classifier to obtain predictions and its probabilites on the test data ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 y_accidents_test_predicted = khc_accidents.predict(X_accidents_test) probas_accidents_test = khc_accidents.predict_proba(X_accidents_test) print("Accidents test predictions (first 10 values):") display(y_accidents_test_predicted[:10]) print("Accidents test prediction probabilities (first 10 values):") display(probas_accidents_test[:10]) .. parsed-literal:: Accidents test predictions (first 10 values): .. parsed-literal:: array(['NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal', 'NonLethal'], dtype=object) .. parsed-literal:: Accidents test prediction probabilities (first 10 values): .. parsed-literal:: array([[0.194344 , 0.805656 ], [0.00707682, 0.99292318], [0.03085459, 0.96914541], [0.08640951, 0.91359049], [0.01865278, 0.98134722], [0.00681306, 0.99318694], [0.0062505 , 0.9937495 ], [0.17195874, 0.82804126], [0.02707476, 0.97292524], [0.01174233, 0.98825767]]) Obtain the accuracy and AUC on the test dataset ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 accidents_test_accuracy = metrics.accuracy_score( y_accidents_test, y_accidents_test_predicted ) accidents_test_auc = metrics.roc_auc_score( y_accidents_test, probas_accidents_test[:, 1] ) print(f"Accidents test accuracy: {accidents_test_accuracy}") print(f"Accidents test auc : {accidents_test_auc}") .. parsed-literal:: Accidents test accuracy: 0.9472518344178319 Accidents test auc : 0.8079238757149434 Explore the report with the Khiops Visualization App ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. code:: ipython3 # To visualize uncomment the lines below # khc_accidents.export_report_file("./accidents_report.khj") # kh.visualize_report("./accidents_report.khj")